================================================================================
This is an exploration of 2016 US presidential election donations in the state of California. For this data analysis, I am exoloring the 2016 presidential campaign finance data from Federal Election Commission. The dataset contains financial contribution transaction.
Through my analysis, I will attempt to answer the following questions:
# Load all of the packages that will be used for analysis
library(readr)
library(ggplot2)
library(dplyr)
library(tidyr)
library(lubridate)
library(gridExtra)
library(plotly)
library(ggmap)
library(maps)
library(tidyverse)
Summarise the dataset, and check column names.
## cmte_id cand_id cand_nm
## Length:1304346 Length:1304346 Length:1304346
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## contbr_nm contbr_city contbr_st
## Length:1304346 Length:1304346 Length:1304346
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## contbr_zip contbr_employer contbr_occupation
## Min. : 0 Length:1304346 Length:1304346
## 1st Qu.:910162321 Class :character Class :character
## Median :930012752 Mode :character Mode :character
## Mean :850773217
## 3rd Qu.:945981502
## Max. :961628693
## NA's :113
## contb_receipt_amt contb_receipt_dt receipt_desc
## Min. :-10500.0 Length:1304346 Length:1304346
## 1st Qu.: 15.0 Class :character Class :character
## Median : 27.0 Mode :character Mode :character
## Mean : 116.2
## 3rd Qu.: 88.0
## Max. : 10800.0
##
## memo_cd memo_text form_tp
## Length:1304346 Length:1304346 Length:1304346
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## file_num tran_id election_tp
## Min. :1003942 Length:1304346 Length:1304346
## 1st Qu.:1077916 Class :character Class :character
## Median :1099613 Mode :character Mode :character
## Mean :1102796
## 3rd Qu.:1133832
## Max. :1146285
##
## [1] "cmte_id" "cand_id" "cand_nm"
## [4] "contbr_nm" "contbr_city" "contbr_st"
## [7] "contbr_zip" "contbr_employer" "contbr_occupation"
## [10] "contb_receipt_amt" "contb_receipt_dt" "receipt_desc"
## [13] "memo_cd" "memo_text" "form_tp"
## [16] "file_num" "tran_id" "election_tp"
From the dataset summary we found that this dataset contains 1304346 observations and 18 variables. Let’s plot contribution graphs against different variables.
Let’s do some basic null checks, before we proceed.
There are no null values under contribution amount column. Let’s plot a simple histogram.
Plot another detailed contribution histogram.
This graph also shows that large number of people have contributed below $500. Also there is a significant number of people those who have contributed between 2500 and 3000.
We can see from the above histogram, that there are negative values also in the contribution amount. Let’s see if they have any other relevant details. Check their receipt description field.
## # A tibble: 16,313 x 9
## cand_nm contbr… contb… contbr… contb_… cont… rece… memo… elec…
## <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr>
## 1 Trump, Donald J. ROPPA,… ALISO… RETIRED -40.0 21-A… <NA> <NA> G2016
## 2 Trump, Donald J. SHARP,… CARLS… RETIRED - 4.00 06-S… <NA> <NA> G2016
## 3 Trump, Donald J. SHARP,… CARLS… RETIRED - 4.00 06-O… <NA> <NA> G2016
## 4 Trump, Donald J. SHARP,… CARLS… RETIRED -28.0 18-O… <NA> <NA> G2016
## 5 Trump, Donald J. SHARP,… CARLS… RETIRED -28.0 25-O… <NA> <NA> G2016
## 6 Trump, Donald J. PAVLOV… CAMAR… UNEMPL… -40.0 26-A… <NA> <NA> G2016
## 7 Trump, Donald J. SHARP,… CARLS… RETIRED -28.0 01-N… <NA> <NA> G2016
## 8 Trump, Donald J. SHARP,… CARLS… RETIRED - 4.00 06-N… <NA> <NA> G2016
## 9 Trump, Donald J. SHAW, … POMONA RETIRED -20.0 16-O… <NA> <NA> G2016
## 10 Trump, Donald J. SHAW, … POMONA RETIRED -20.0 23-O… <NA> <NA> G2016
## # ... with 16,303 more rows
## [1] 10412
## [1] 30
There are lot of records (10,412 out of 16,313), that have receipt description as “Refund”. It could be that the contributor changed his mind, and asked for a refund at a later stage. Other 2 categories of description are “Redesignation” and “Reattribution”. Let’s see how significant is the sum of negative amount in total.
Lets do some calculation to see the total sum of positive and negative values.
Check number of rows containing positive amount, and number of rows containing negative amount.
So out of 1304346 observations, 16313 are with negative amount values.
Lets see how the contribution is done over a period of time. We will plot a time series line chart for each party to see trend of contribution received towards begin or end of the election time.
Let’s do a basic null check on date before we proceed
## [1] 0
There are no rows without date field. Check how many unique date records are present.
count_of_contb_dt <- length(unique(DF$contb_receipt_dt, incomparables = FALSE,
MARGIN = 1, fromLast = FALSE))
print(count_of_contb_dt)
## [1] 732
So the contribution distribution is spread across more than 2 years. Let’s group the contribution by month.
The top 10 dates of contribution show Dec 2015 being the peak month of contribution, followed by Sep 2015.
Let’s visualize through a graph.
In the Univariate section we explored the “Contribution Amount” variable. We saw that contribution was mostly between the range of 0 to 500 Dollar. Also there was a slight peak at 1000 and 2700 dollar. There were mostly positive contribution values, but there were some observations with negative values too. The negative values had description as Refund, Reattribution and Redesignation. For the purpose of current exploration the total amount is calculated as sum of positive amount minus the sum of negative amount.
There are 1304346 contributions and 18 variables. The variables that interest to me and I have used are:
cand_nm: Candidate Name contb_receipt_amt: Contribution Amount contbr_occupation: Contributor Occupation contbr_city: Contributor City contb_receipt_dt: Contribution date election_tp: Type of election (Primary, General)
Othere observations:
Most people contribute small amount of money. The median contribution amount is $27, mean contribution amount is $116. The amount of contribution is highest in Aug-Sep 2016, that is just before the election.
From the above graph we see that contribution was maximum between Aug 2016 till Oct 2016, just before the election time frame. Also we can note that the contribution started to pick from Apr 2015.
Next we will see how this total “contribution amount” is distributed w.r.t other factors like candidates, political parties, contributor’s occupation and contributor’s city.
Get the Unique candidate names to see how many candidates stood up for election. Summarise the total contribution amount for each candidate.
count_of_nm <- length(unique(DF$cand_nm, incomparables = FALSE, MARGIN = 1,
fromLast = FALSE))
print(count_of_nm)
## [1] 25
print(unique(DF$cand_nm))
## [1] "Clinton, Hillary Rodham" "Trump, Donald J."
## [3] "Sanders, Bernard" "O'Malley, Martin Joseph"
## [5] "Santorum, Richard J." "Cruz, Rafael Edward 'Ted'"
## [7] "Walker, Scott" "Bush, Jeb"
## [9] "Rubio, Marco" "Kasich, John R."
## [11] "Christie, Christopher J." "Johnson, Gary"
## [13] "Paul, Rand" "Webb, James Henry Jr."
## [15] "Carson, Benjamin S." "Fiorina, Carly"
## [17] "Jindal, Bobby" "Huckabee, Mike"
## [19] "Lessig, Lawrence" "Graham, Lindsey O."
## [21] "Pataki, George E." "Stein, Jill"
## [23] "Perry, James R. (Rick)" "McMullin, Evan"
## [25] "Gilmore, James S III"
We see that there are 25 unique candidtes. Plot a bar chart to see how much contribution amount is received per candidate.
DF_cand_dist <- DF %>%
group_by(cand_nm) %>%
summarise(candidate_amt = sum(contb_receipt_amt, na.rm=TRUE),
n= n())
DF_cand_dist
## # A tibble: 25 x 3
## cand_nm candidate_amt n
## <chr> <dbl> <int>
## 1 Bush, Jeb 3300292 3130
## 2 Carson, Benjamin S. 2912555 27370
## 3 Christie, Christopher J. 456066 333
## 4 Clinton, Hillary Rodham 93681171 688524
## 5 Cruz, Rafael Edward 'Ted' 5730682 57822
## 6 Fiorina, Carly 1450689 4706
## 7 Gilmore, James S III 8100 3
## 8 Graham, Lindsey O. 414495 347
## 9 Huckabee, Mike 230891 531
## 10 Jindal, Bobby 23231 31
## # ... with 15 more rows
DF_cand_dist <- head(arrange(DF_cand_dist, desc(candidate_amt)), n= 10)
Let’s plot the graph.
Check political party wise contribution.
Draw pie chart to see distribution of contribution amount received by each political party.
First sum the contribution party wise.
DF_party_dist <- DF %>%
group_by(Political_Party) %>%
summarise(party_amt = sum(contb_receipt_amt, na.rm=TRUE))
head(DF_party_dist)
## # A tibble: 4 x 2
## Political_Party party_amt
## <chr> <dbl>
## 1 Democratic_Party 113491148
## 2 Green_Party_of_USA 751785
## 3 Libretarian_Party 495231
## 4 Republic_Party 36865654
Lets see how many election types are there.
count_of_election_tp <- length(unique(DF$election_tp, incomparables= FALSE,
MARGIN = 1, fromLast = FALSE))
print(count_of_election_tp)
## [1] 5
print(unique(DF$election_tp))
## [1] "P2016" "G2016" NA "P2020" "O2016"
We see that there are 5 unique election types. Let’s see contribution per election type.
DF_election_tp <- DF %>%
group_by(election_tp) %>%
summarise(sum_election_tp = sum(contb_receipt_amt, na.rm=TRUE),
mean_election_tp = mean(contb_receipt_amt),
n = n())
DF_election_tp <- head(arrange(DF_election_tp,desc(sum_election_tp)))
head(DF_election_tp)
## # A tibble: 5 x 4
## election_tp sum_election_tp mean_election_tp n
## <chr> <dbl> <dbl> <int>
## 1 P2016 93965415 115 818021
## 2 G2016 56931973 118 483991
## 3 O2016 453994 718 632
## 4 <NA> 237435 140 1695
## 5 P2020 15000 2143 7
Most of the contribution is for election type P2016. This could be the Primary election donation. The next type of election that has received most contributions is G2016, this could be the general elections. Source of information wikipedia website.
## # A tibble: 818,021 x 9
## cand_nm contbr… contbr… contbr_… contb_r… contb… rece… memo_text elec…
## <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr>
## 1 Clinton… AULL, … LARKSP… RETIRED 50.0 26-AP… <NA> * HILLAR… P2016
## 2 Clinton… CARROL… CAMBRIA RETIRED 200 20-AP… <NA> * HILLAR… P2016
## 3 Clinton… GANDAR… FONTANA RETIRED 5.00 02-AP… <NA> * HILLAR… P2016
## 4 Sanders… LEE, A… CAMARI… SOFTWAR… 40.0 04-MA… <NA> * EARMAR… P2016
## 5 Sanders… LEONEL… REDOND… PHARMAC… 35.0 05-MA… <NA> * EARMAR… P2016
## 6 Sanders… LEONEL… REDOND… PHARMAC… 100 06-MA… <NA> * EARMAR… P2016
## 7 Sanders… LEOPAR… VISTA PROJECT… 25.0 04-MA… <NA> * EARMAR… P2016
## 8 Clinton… HOFER,… LAGUNA… RETIRED 40.0 20-AP… <NA> * HILLAR… P2016
## 9 Sanders… LEPKE,… WESTMI… NOT EMP… 10.0 05-MA… <NA> * EARMAR… P2016
## 10 Sanders… LERCH,… PETALU… DIRECTO… 15.0 06-MA… <NA> * EARMAR… P2016
## # ... with 818,011 more rows
## # A tibble: 6 x 4
## cand_nm sum_election_tp_p2016 mean_election_tp n
## <chr> <dbl> <dbl> <int>
## 1 Clinton, Hillary Rodham 46434728 181 256294
## 2 Sanders, Bernard 19623823 48.2 407163
## 3 Cruz, Rafael Edward 'Ted' 5900103 105 56402
## 4 Rubio, Marco 4995681 376 13272
## 5 Trump, Donald J. 4439995 113 39164
## 6 Bush, Jeb 3317092 1085 3057
## cand_nm sum_election_tp_p2016 mean_election_tp
## Length:6 Min. : 3317092 Min. : 48.2
## Class :character 1st Qu.: 4578916 1st Qu.: 106.8
## Mode :character Median : 5447892 Median : 147.3
## Mean :14118570 Mean : 318.1
## 3rd Qu.:16192893 3rd Qu.: 327.6
## Max. :46434728 Max. :1085.1
## n
## Min. : 3057
## 1st Qu.: 19745
## Median : 47783
## Mean :129225
## 3rd Qu.:206321
## Max. :407163
Above is a graph for primary election contribution amount by candidates. Hillary got most contribution during the primary elections as compared to other candidates, followed by Bernard Sanders. The mean contribution amount is $318.1 for primary elections, this is more than the mean of total contribution amount that is $116.
count_of_occupation <- length(unique(DF$contbr_occupation, incomparables= FALSE,
MARGIN = 1, fromLast = FALSE))
print(count_of_occupation)
## [1] 28616
We see that there are 28616 unique occupations of the contributors. People from so many occupations participated in contributing to the election. It would be difficult to see contribution spread against all of the occupation. Lets pick top 10 occupation categories.
DF_occu_dist <- DF %>%
group_by(contbr_occupation) %>%
summarise(sum_occup = sum(contb_receipt_amt, na.rm=TRUE),
mean_occu = mean(contb_receipt_amt),
n = n())
DF_occu_dist <- head(arrange(DF_occu_dist,desc(sum_occup)), n = 10)
DF_occu_dist
## # A tibble: 10 x 4
## contbr_occupation sum_occup mean_occu n
## <chr> <dbl> <dbl> <int>
## 1 RETIRED 25158880 96.6 260546
## 2 ATTORNEY 8242015 225 36642
## 3 NOT EMPLOYED 6484973 56.1 115598
## 4 INFORMATION REQUESTED 5388307 174 30948
## 5 HOMEMAKER 4800790 278 17268
## 6 CEO 3421376 477 7174
## 7 PHYSICIAN 2644065 164 16111
## 8 CONSULTANT 2503911 179 13961
## 9 PRESIDENT 2354773 498 4728
## 10 LAWYER 2157174 241 8945
summary(DF_occu_dist)
## contbr_occupation sum_occup mean_occu n
## Length:10 Min. : 2157174 Min. : 56.1 Min. : 4728
## Class :character 1st Qu.: 2538950 1st Qu.:166.6 1st Qu.: 10199
## Mode :character Median : 4111083 Median :202.1 Median : 16690
## Mean : 6315626 Mean :238.9 Mean : 51192
## 3rd Qu.: 6210806 3rd Qu.:268.8 3rd Qu.: 35218
## Max. :25158880 Max. :498.0 Max. :260546
From the summary we see that the occupation categories “ATTORNEY”, “HOMEMAKER”, “CEO”, “PRESIDENT”, “LAWYER” have the higher mean of contribution than compared to the mean of total contributing amount from all categories together.
Category “PRESIDENT” has the maximum mean of contributing amount. Category “NOT EMPLOYED” has the minimum mean of contributing amount.
This is an amazing graph. It shows retired category of people contributing most to the election funds.
Check city wise contribution, and then find out top 10 contributing cities. Below is graph for top 10 contributing cities.
count_of_city <- length(unique(DF$contbr_city, incomparables = FALSE,
MARGIN = 1, fromLast = FALSE))
print(count_of_city)
## [1] 2534
DF_contbr_city <- DF %>%
group_by(contbr_city) %>%
summarise(sum_city = sum(contb_receipt_amt, na.rm=TRUE),
mean_city = mean(contb_receipt_amt),
n = n())
DF_contbr_city <- head(arrange(DF_contbr_city,desc(sum_city)), n = 10)
DF_contbr_city
## # A tibble: 10 x 4
## contbr_city sum_city mean_city n
## <chr> <dbl> <dbl> <int>
## 1 LOS ANGELES 16220656 158 102710
## 2 SAN FRANCISCO 15376476 169 90937
## 3 SAN DIEGO 3849797 83.5 46129
## 4 PALO ALTO 3261409 269 12105
## 5 OAKLAND 3150637 94.8 33235
## 6 BEVERLY HILLS 3125763 460 6796
## 7 BERKELEY 2863864 124 23150
## 8 SANTA MONICA 2854454 197 14495
## 9 SAN JOSE 2408418 78.5 30674
## 10 SACRAMENTO 2343078 98.5 23799
Los Angeles is the most contributing city out of all, followed closely by San Francisco. We see that the “amount of contribution” from these cities is more, but is the “count of contributions” also more from these cities. Let’s see number of contributions per candidate, occupation and city.
Group contribution amount per candidate, per occupation, per city.
options(scipen = 999)
DF_cand_occu_city_grp <- DF %>%
group_by(.dots=c("cand_nm","contbr_occupation","contbr_city")) %>%
summarise(sum_cand_occu_city=sum(contb_receipt_amt),
n = n())
DF_cand_occu_city_grp <- (arrange(DF_cand_occu_city_grp,desc(sum_cand_occu_city)))
DF_cand_occu_city_grp
## # A tibble: 118,021 x 5
## # Groups: cand_nm, contbr_occupation [39,202]
## cand_nm contbr_occupation contbr_city sum_cand… n
## <chr> <chr> <chr> <dbl> <int>
## 1 Clinton, Hillary Rodham RETIRED SAN FRANCISCO 1172716 7570
## 2 Clinton, Hillary Rodham ATTORNEY LOS ANGELES 1138550 3948
## 3 Clinton, Hillary Rodham ATTORNEY SAN FRANCISCO 1045324 3687
## 4 Clinton, Hillary Rodham RETIRED LOS ANGELES 978573 8480
## 5 Clinton, Hillary Rodham WRITER LOS ANGELES 475835 2620
## 6 Clinton, Hillary Rodham RETIRED SAN DIEGO 460634 5875
## 7 Clinton, Hillary Rodham RETIRED BERKELEY 370685 2626
## 8 Clinton, Hillary Rodham RETIRED OAKLAND 369695 4469
## 9 Clinton, Hillary Rodham RETIRED SACRAMENTO 338157 4152
## 10 Clinton, Hillary Rodham HOMEMAKER LOS ANGELES 334893 789
## # ... with 118,011 more rows
## [1] "Clinton, Hillary Rodham" "Sanders, Bernard"
## [3] "Trump, Donald J." "Cruz, Rafael Edward 'Ted'"
## [5] "Rubio, Marco" "Bush, Jeb"
## [7] "Carson, Benjamin S." "Kasich, John R."
## [9] "Fiorina, Carly" "Paul, Rand"
## [1] "RETIRED" "ATTORNEY"
## [3] "NOT EMPLOYED" "INFORMATION REQUESTED"
## [5] "HOMEMAKER" "CEO"
## [7] "PHYSICIAN" "CONSULTANT"
## [9] "PRESIDENT" "LAWYER"
## [1] "LOS ANGELES" "SAN FRANCISCO" "SAN DIEGO" "PALO ALTO"
## [5] "OAKLAND" "BEVERLY HILLS" "BERKELEY" "SANTA MONICA"
## [9] "SAN JOSE" "SACRAMENTO"
In the above graph the occupation categories are placed as per the amount of donations they have made. The most significant observation is between “Attorney” and “Not Employed”. We see, that though more number of “Not Employed” people have donated but there amount of contribution was less than the amount of contributions done by category “Attorney”.
On the contribution city graph also the cities are arranged in the descending order of their amount of contributions done. The most significant point is where amount of contributions done by city “Palo Alto” is more than “Oakland”, but here in the graph the number of contributions done by city of “Palo Alto” is much lesser than “Oakland”.
This may suggest that more financially sound people stay in “Palo Alto”.
with(DF, cor(contb_receipt_amt, rank(cand_nm)))
with(DF, cor(contb_receipt_amt, rank(contbr_occupation)))
with(DF, cor(contb_receipt_amt, rank(contbr_city)))
with(DF, cor(rank(cand_nm), rank(contbr_occupation)))
with(DF, cor(rank(cand_nm), rank(contbr_city)))
with(DF, cor(rank(contbr_occupation), rank(contbr_city)))
with(DF_cand_occu_city_grp, cor(sum_cand_occu_city, n))
There doesn’t seem to be much correlation between candidate name and contributor city, or between contributor city and occupation. One thing that shows a strong uphill correlation is sum of contrubition amount and the number of contributions. That is more the “number of contributions” per candidate, per occupation, per city , more is the value of contribution.
Percentage of “Retired” category of the total contributors is 20%. That is almost 1/5th of the total contributors are Retired Category.
Let’s check the retired percentage for Los Angeles and San Francisco.
We calculated that, roughly 10% of all the contributors from Los Angeles are of “Retired” category. This is half of total percentage.
Similar to Los Angeles, from San Francisco also 10% of all the contributors are of “Retired” category. This data shows the percentage of “Retired” category contributors is quite significant in other cities also, and not in LA and SF only.
For the bivariate analysis I saw how the total contribution amount is distributed w.r.t following factors.
There is no mention of political party in the dataset, this was added to the dataset. I added the column political_party and filled the column with corresponding party name for each candidate’s political party. Used website http://www.politifact.com/ to get info on 25 unique candidates. I found that 25 candidates belonged to 4 different political parties, namely
Hillary Clinton received the most contribution. That also reflected in the political party she represented. Democratic party received the most contribution almost 75% of the total contributed amount. Retired people contributed the most Most contributing cities were Los Angeles and San Francisco. Most of the contributions were made for primary election.
Let’s plot a map diagram, to see location of the most contributing cities, on the California map.
## lon lat contbr_city sum_city mean_amt_per_city n
## 1 -118.2437 34.05223 LOS ANGELES 16220656 157.92675 102710
## 2 -122.4194 37.77493 SAN FRANCISCO 15376476 169.08933 90937
## 3 -117.1611 32.71574 SAN DIEGO 3849797 83.45719 46129
## 4 -122.1430 37.44188 PALO ALTO 3261409 269.42659 12105
## 5 -122.2711 37.80436 OAKLAND 3150637 94.79876 33235
## 6 -118.4004 34.07362 BEVERLY HILLS 3125763 459.94154 6796
## 7 -122.2585 37.87190 BERKELEY 2863864 123.70903 23150
## 8 NA NA SANTA MONICA 2854454 196.92678 14495
## 9 -121.8863 37.33821 SAN JOSE 2408418 78.51659 30674
## 10 -121.4944 38.58157 SACRAMENTO 2343078 98.45277 23799
## 11 -118.1445 34.14778 PASADENA 1649200 128.19274 12865
## 12 NA NA MENLO PARK 1520624 284.65444 5342
## 13 NA NA PACIFIC PALISADES 1490631 324.54418 4593
## 14 -117.9298 33.61888 NEWPORT BEACH 1484437 282.64230 5252
## 15 -119.6982 34.42083 SANTA BARBARA 1480510 123.38609 11999
## 16 -122.1141 37.38522 LOS ALTOS 1321927 300.91663 4393
## 17 -118.1937 33.77005 LONG BEACH 1184770 75.19005 15757
## 18 -117.8265 33.68457 IRVINE 1172455 138.00084 8496
## 19 -118.4514 34.14897 SHERMAN OAKS 1091782 142.75388 7648
## 20 -122.1977 37.46133 ATHERTON 1086700 878.49639 1237
In the above graph, size of the orange dot specifies the count of contribution.
Let’s plot some heat maps, to see the multivariate effect on the contributions. We will see contributions for candidates w..r.t cities and occupations. So far we know most contributing occupation, but we will see most contributing occupation per city through some heat maps.
## # A tibble: 9,742 x 4
## # Groups: cand_nm [25]
## cand_nm contbr_city sum_cand_city n
## <chr> <chr> <dbl> <int>
## 1 Clinton, Hillary Rodham SAN FRANCISCO 12220950 56833
## 2 Clinton, Hillary Rodham LOS ANGELES 11997666 65166
## 3 Clinton, Hillary Rodham PALO ALTO 2603319 8344
## 4 Clinton, Hillary Rodham OAKLAND 2276655 19587
## 5 Clinton, Hillary Rodham BEVERLY HILLS 2190303 4576
## 6 Clinton, Hillary Rodham BERKELEY 2143883 12061
## 7 Clinton, Hillary Rodham SANTA MONICA 2105459 9031
## 8 Clinton, Hillary Rodham SAN DIEGO 2070972 24283
## 9 Sanders, Bernard SAN FRANCISCO 1814651 31078
## 10 Clinton, Hillary Rodham SACRAMENTO 1618197 14036
## # ... with 9,732 more rows
## # A tibble: 39,202 x 4
## # Groups: cand_nm [25]
## cand_nm contbr_occupation sum_cand_occu n
## <chr> <chr> <dbl> <int>
## 1 Clinton, Hillary Rodham RETIRED 14238969 161257
## 2 Clinton, Hillary Rodham ATTORNEY 6727956 27956
## 3 Sanders, Bernard NOT EMPLOYED 5286576 105655
## 4 Trump, Donald J. RETIRED 4449440 34257
## 5 Clinton, Hillary Rodham INFORMATION REQUESTED 3050159 13897
## 6 Clinton, Hillary Rodham HOMEMAKER 2877721 12148
## 7 Clinton, Hillary Rodham CEO 2150276 4111
## 8 Clinton, Hillary Rodham CONSULTANT 2015791 9402
## 9 Clinton, Hillary Rodham LAWYER 1801739 6917
## 10 Clinton, Hillary Rodham PHYSICIAN 1795537 9917
## # ... with 39,192 more rows
## # A tibble: 92,238 x 4
## # Groups: contbr_occupation [28,616]
## contbr_occupation contbr_city sum_cand_occu n
## <chr> <chr> <dbl> <int>
## 1 RETIRED SAN FRANCISCO 1476819 9096
## 2 ATTORNEY LOS ANGELES 1366596 4800
## 3 RETIRED LOS ANGELES 1350867 10387
## 4 ATTORNEY SAN FRANCISCO 1135922 4350
## 5 RETIRED SAN DIEGO 787443 9224
## 6 HOMEMAKER LOS ANGELES 547853 973
## 7 WRITER LOS ANGELES 541229 3391
## 8 NOT EMPLOYED SAN FRANCISCO 495329 5763
## 9 RETIRED SACRAMENTO 478660 5816
## 10 SOFTWARE ENGINEER SAN FRANCISCO 466443 3658
## # ... with 92,228 more rows
The above heatmaps show, Hillary having good number of contributions from Los Angeles and San Francisco. Only top two candidates have more than 2000, contributions from most of the cities, rest of the candidates have less than 2000 contributions, across all cities.
On the occupation wise map also Hillary has got contributions across all occupations.
“Not Employed” category of people have contributed most to Bernard Sanders.
We know that Hillary Clinton raised the most money and had the most supporters in California. But is this always true throughout the campaign process? When we look at above 2 graphs, we can notice few things.
Hillary Clinton had most number of contributions throughout. Number of contributions for Bernard Sanders rose quite consistently. Number of contributions for Donald Trump fell towards the end of campaign. Towards the end only Bernard Sanders was in some competition to Hillary Clinton in terms of number of contributions.
This graph shows the count of contributions for each range of amount. Large number of people have made small donations between 0 to 250 dollars. Many contributions are done for the amount of 500, 1000 and 2700 Dollars. From the summary we saw that mean contribution amount is 116 Dollars. This can be seen on the graph.
Hillary Clinton was the top candidate in terms of contibutions recieved. Her share of contribution was highest from the begin of primary election too. This graph answers my question that I thought of at the beginning of my exploration.
This was one of the most interesting graph. Retired people contributed most to the 2016 election. We had also known that Los Angeles was the most contributing city. Does this mean that most of the retired category people stay at LA? This may not be a direct correlation, but something that can be explored. Also another correlation that can be thought of is that did Hillary receive most contribution from “Retired” category of people?
This was a large dataset with more than a million and a quarter observations, which had details about the contributions made to political candidates during the 2016 US Presidential elections.
I was most interested to see which political party received the most funds. There was no political party column in the given dataset. I found the unique candidate names first and then searched for their parties, to finally see the pie chart for party wise funding. For the purpose of seeing the trend of contributions, on a time series, I had added two columns Month_Yr, and yyyymm.
The dataset was not a perfect clean data. I was getting parsing error while creating the dataframe. I found out that there was a extra comma character at the end of the last column, which was removed for successful parsing of csv file . Also there were 7 columns where the zip code was non integer, example N4W2T. I found out that this zipcode belonged to Canada, and not California, USA. Such records were replaced by ‘000000000’ value.
The most difficult decision for me was to handle the negative amount values. I did not want to ignore them initially. But after the entire exploration I realize that probably ignoring those values was a better choice. By the description it shows that, it is the contribution money to be refunded. It may not have reached the candidate/party at all. It was marked as negative, to be actually ignored.
For the future exploration I would like to see number of contributions and their respective contributors for large contribution amounts, above a certain average.
I could see total number of contributors per candidate, would like to see the data of number of contributors per candidate per city per occupation in one graph. I couldn’t achieve more than 2 group by in one single graph.
During explortion I realized that most of the data in this dataset is categorical, except one continuous data that is “Contribution Amount”. Rest most of the variables, that is candidate name, contributor occupation, contributor city, election type etc were discrete data points. So I have mostly plotted bar charts, and not scatter plots or line charts.